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1.  Introduction 


The  U.S.  Army  Research  Laboratory  is  using  a  knowledge  discovery  in  databases  (KDD) 
approach  to  find  patterns  and  structure,  if  any,  in  documentation  (intelligence  reports,  news 
articles,  etc.)  concerning  terrorist-related  events  (see  figure  1).  The  intent  is  to  expedite  the  KDD 
process  for  such  activity  so  that  it  can  be  disrupted  or  thwarted.  Sometimes  there  are  indicators, 
but  often  the  relevant  information  is  buried  within  a  massive  amount  of  other  data.  High¬ 
dimensional  data  (HDD)  may  increase  the  chances  of  an  incorrect  pattern.  This  so-called  “curse 
of  dimensionality”  may  include  anomalies  in  the  raw  data  caused  by  (1)  sensor  malfunction  in 
extreme  environmental  conditions  or  (2)  errors  resulting  from  computer  program  code,  such  as 
the  floor  function  approximation.  And  then  there  is  the  challenge  of  possible  multilingual  data 
mining  under  a  time  constraint.  All  of  these,  and  more,  are  reasons  why  anticipating  a  terrorist 
event  is  an  extremely  difficult  task. 

This  report  addresses  the  stage  in  the  KDD  process  from  dimensionality  reduction  (DR)  to 
interpretation — namely,  feature  selection  (FS),  feature  extraction  (FE),  and  data-mining  methods. 
The  approach  used  here  involves  transforming  unstructured  text,  such  as  that  from  the  Global 
Terrorism  Database  (i),  and  is  very  HDD,  to  a  two-  or  three-dimensional  (2-D/3-D)  data 
representation  suitable  for  visual  analytics  (VA)  application.  Exploratory  data  analytics  (EDA), 
which  is  closely  related  to  the  field  of  data  mining,  is  used  to  discover  knowledge  in  the  data. 

The  next  section  describes  how  an  EDA  problem  in  HDD  space  becomes  an  exploratory  visual 
analytics  (EVA)  one  in  2-D/3-D  Euclidean  space — when  a  point  set  embedded  in  a  high¬ 
dimensional  geometric  space  is  transformed  to  a  visually  based  distribution  shape  or  structure. 

To  our  knowledge,  successive  application  of  FS  (section  2.1)  prior  to  FE  (section  2.2)  has  not 
been  considered,  and  thus  the  usefulness  is  still  being  evaluated.  FS  is  semantic -preserving, 
while  FE  destroys  semantics  but  allows  us  to  examine  the  underlying  relationships  in  the  data 
even  though  the  meaning  of  the  variables  is  lost.  The  assumption  here  is  that  humans  most 
effectively  understand  HDD  as  2-D/3-D  objects/structure  in  Euclidean  space  (2). 

Section  3  discusses  EVA,  which  allows  for  3-D  geometric  manipulation  of  the  data.  For  a  2-D 
view,  affine  transformation(s)  is/are  followed  by  an  orthographic  projection  onto  an  arbitrary 
plane.  Sometimes  just  looking  at  the  data  from  different  views  reveals  something  interesting  or 
informative;  otherwise,  analyzing  the  same  unstructured  text  would  be  difficult. 

Finally,  we  conclude  and  suggest  where  future  efforts  can  be  made  for  a  more  effective  terrorist 
KDD. 
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Figure  1.  KDD  process  for  terrorist  data  (adapted  from  Nieves  and  Cruz  [5]). 
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2.  Dimensionality  Reduction  for  Data  Visualization 


Interpretation  of  any  underlying  structure  for  data  in  HDD  space  (d)  is  done  by  re-embedding 
into  a  lower  2-D/3-D  Euclidean  space.  The  projection  could  be  for  a  nonlinear  manifold,  which 
is  locally  linear  but  may  be  globally  curved.  The  projection  should  remain  representative  of  the 
original  data  so  that  there  is  no  loss  of  information  and  properties  are  preserved.  DR  of  HDD  is 
done  here  by  FS  and/or  FE. 

Note  that  DR  tries  to  exploit  the  typically  lower  intrinsic  dimension  (P)  of  the  data,  i.e.,  P  <  d.  P 
is  the  minimum  number  needed  to  account  for  observed  properties  of  the  data  and  reveals  the 
presence  of  topological  structure.  Ideally,  the  reduced  dimension  (D)  will  correspond  to  P. 
When  P  <  D,  where  D  is  also  the  dimension  of  the  embedding  space,  then  the  data  lies  in  a 
well-defined  space. 

2.1  Feature  Selection 

FS  determines  several  features  (or  attributes)  for  the  HDD  by  removing  irrelevant  and  redundant 
data.  An  example  of  a  decision  system  M,  which  can  be  represented  as  a  matrix  of  objects  and 
attributes,  is  illustrated  in  table  1  (4).  The  search  for  a  feature  subset  involves  determining  those 
that  are  highly  correlated  with  the  decision  attribute  but  uncorrelated  with  one  another:  or,  in 
other  words,  compute  the  smallest  subset  of  conditional  attributes  that  preserve  the  decision 
attribute.  This  is  called  a  reduct.  The  example  shown  in  table  2  was  computed  using  rough  set 
theory  (RST).*  In  this  case,  we  obtained  a  50%  reduction  of  conditional  data,  i.e.,  we  could 
safely  eliminate  half  of  the  conditional  attributes  without  changing  the  value  of  e,  i.e.,  Ve. 


Table  1.  An  example  data  table  from  Jensen  and 
Shen  (4). 
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Note:  For  objects  0  to  7,  the  conditional  attributes  are 
from  a  to  d,  and  the  decision  attribute  is  e. 


* 


Only  the  results  are  shown  here.  A  detailed  description  for  this  example  is  given  in  Jensen  and  Shen  (4). 
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Table  2.  Reduced  data  set  for 
table  1. 
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RST  is  an  extension  of  conventional  set  theory,  thus  is  discrete-based.  Uncertainty  is 
“indiscernibility”  for  a  rough  set  attribute  reduction.  However,  a  “vagueness”  of  feature  data, 
i.e.,  real-valued  attributes,  is  not  modeled. 

Fuzzy-rough  set  theory  (FRST)  handles  both  discrete  and  continuous  data.  The  implementation 
of  fuzzy-rough  feature  selection  (FRFS)  being  used  in  our  work  was  written  for  the  University  of 
Waikato  (NZ)  environment  for  knowledge  analysis  (WEKA).  WEKA  (5)  is  a  popular  open- 
source  environment.  In  particular,  we  are  using  the  Java  jar  for  ant  colony  optimization  (ACO), 
i.e.,  FRFS-ACO,  for  a  search  of  the  feature  space.  FRFS-ACO  requires  a  graph  representation  in 
determining  the  reduct. 

Ideally,  FS  will  result  in  2-D/3-D  data  for  VA  application.  For  more  than  three-component 
results,  the  human  visual  system/brain  combination  usually  becomes  less  effective  quite  quickly. 
In  this  case,  an  FE  is  then  applied. 

2.2  Feature  Extraction 

FE  irreversibly  transforms  data  semantics,  but  the  underlying  topology  of  the  structure,  if  any,  is 
preserved  and  can  be  further  examined.  In  topology,  the  concern  is  not  the  representation  of  an 
object  (or  structure)  in  space,  but  the  connectivity,  which  must  not  be  altered.  In  other  words, 
twisting,  deforming,  and/or  stretching  are  allowed,  but  no  tearing.  For  example,  a  2-D  circle  is 
topologically  equivalent  to  an  ellipse. 

Many  FEs  have  been  developed  over  the  years.  In  figure  2,  the  first  two  approximations — 
principal  component  analysis  (PCA)  and  classical  metric  multidimensional  scaling  (CMDS) — 
are  a  linear  DR  (FDR).  An  FDR  is  based  on  a  linear  combination  of  the  feature  data.  FDRs 
keep  similar  data  points  close  together  (distance-preserving)  when  mapping  from  d  to  D. 
However,  they  cannot  find  curved  manifolds  since  they  are  based  on  a  Euclidean  distance. 
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xx  Linear  approximation 

principal  component  analysis  (PCA) 

♦+  classical  metric  multidimensional  scaling  (CMDS) 

xx  Nonlinear  approximation 

nonmetric  MDS  (MDS) 

-}--{*  Isomap 

locally  linear  embedding  (LLE) 

http://www.cs.nyu.edu/~roweis/lle/ 

-{»-{-  Laplacian  eigenmaps  (LE) 

stochastic  neighbor  embedding  (SNE)/ 
t-distributed  SNE  (t-SNE) 

http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for 

_Dimensionality_Reduction.html 

neighbor  retrieval  visualizer  (NeRV)/ 
t-distributed  NeRV  (t-NeRV) 
http://research.ics.tkkfi/mi/software/dredviz 

Source:  John  A.  Lee  and  Michael  Verleysen 
Universite  catholique  de  Louvain, 

Louvain-La-Neuve,  Belgium  (2011) 
http://jds2011.tn.refer.org/Pdflnvites/Lee.pdf 


Figure  2.  Some  FEs  for  HDD  and  timeline. 


A  nonlinear  DR  (NLDR)  approximation,  which  is  also  called  a  manifold  learner,  preserves 
geodesic  distances  along  the  manifold,  linear  or  nonlinear  (see  figure  3  for  a  comparison  between 
Euclidean,  geodesic  distance).  NLDRs  include  nonmetric  MDS,  Isomap,  LLE,  LE,  SNE/t-SNE, 
and  NeRV/t-NeRV.  Most  papers  for  an  NLDR  approximation  demonstrate  the  algorithm  using 
an  artificial  dataset,  such  as  the  Swiss  roll  or  S-curve,  and  thus  are  not  straightforward  to 
application  of  real-world  data. 


Figure  3.  Comparison  of  Euclidean  vs.  geodesic  distance.  LDRs  use  metrics 
based  on  the  Euclidean  distance  between  two  points,  while  the 
NLDRs  are  based  on  geodesic  distance.  An  NLDR  successfully 
unrolls  the  curved  manifold,  whereas  an  LDR  fails. 

A  recent  research  paper  ( 6 )  suggests  that  manifold  learners  may  not  be  the  best  DRs  for  data 
visualization.  The  last  two  methods  for  NLDR  in  figure  2,  namely  SNE/t-SNE  and 
NeRV/t-NeRV,  are  NLDRs  specifically  designed  for  data  visualization,  and  have  been  used  with 
real-world  data;  NeRV  is  an  MDS  for  detecting  local  structures,  i.e.,  an  LMDS.  That  paper  also 
states  that  SNE  is  a  special  case  of  NeRV  (k  =  1  in  equation  1  of  the  paper). 


6 


3.  Exploratory  Visual  Analytics 


As  mentioned  in  the  previous  section,  EDA  in  HDD  space  is  done  statistically  using  WEKA 
(figure  4).  Launching  the  Explorer  application  from  the  graphical  user  interface  (GUI)  provides 
for  FRST  attribute  reduction  of  HDD — specifically,  an  ant  colony  optimization  (FRFS-ACO) 
search  for  reduct  as  described  by  Jensen  (7).  For  a  reduct  that  is  >3,  we  then  apply  the  neighbor 
retrieval  visualizer  (NeRV)  (<5). 


Figure  4.  WEKA  GUI  for  data  mining  HDD  using  FRFS-ACO. 


NeRV  is  a  local  MDS.  Although  semantics  are  destroyed  by  a  feature  extraction,  the  topology  of 
the  structure  (or  the  random,  scattered  points)  can  then  be  inspected.  Remember  that  the  intent  is 
to  visually  examine  the  data  in  a  Euclidean  space,  i.e.,  EVA. 

The  resulting  scene  is  described  declaratively  for  the  Extensible  3D  (X3D)  application¬ 
programming  interface  (API).  X3D  is  an  International  Standards  Organization  (ISO) 
specification  for  describing  scene  content,  possibly  distributed  across  the  Web.  The  scene  graph 
consists  of  a  directed  acyclic  graph  of  X3D  objects  and  has  a  hierarchical  parent-child  structure. 
In  addition,  the  immersive  profile  for  the  X3D  scene  allows  for  navigation/interaction  within  the 
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data.  Details  of  all  X3D  nodes  and  attributes  can  be  found  at  http://www.web3d.org 
/x3d/specifications/ISO-IEC-19775-X3DAbstarcrSpecification/;  an  excellent  description  of  X3D 
nodes  and  concepts  is  also  done  by  Brutzman  and  Daly  (8). 

In  2010,  X3D  nodes  were  tightly  coupled  with  the  HTML  document  object  model  (DOM)  tree 
(9)  for  some  Web  browsers,  such  as  Mozilla  Firefox  and  Google  Chrome.  The  result  was  an 
X3DOM  library  where  one  could  embed  X3D  models  directly  into  a  Web  page  without  having  to 
write  any  JavaScript  code.  X3DOM  uses  the  WebGL  API  to  render  interactive  3-D  scenes 
natively  in  the  Web  browser. 


4.  Conclusions  and  Future  Work 


EVA  in  an  HDD  space  for  a  timely  interpretation  remains  to  this  day  a  very  challenging  task, 
especially  for  terrorist-related  data.  Dr.  Nam  suggests  in  her  dissertation  (10)  that  our  perception 
in  3-D  is  learned  from  infancy,  and  that  it  is  essentially  nonexistent  for  higher  dimensions.  Thus 
it  becomes  more  difficult  in  time  to  reason  in  higher  dimensions. 

Both  feature  selection  and  feature  extraction,  if  necessary,  are  used  for  dimensionality  reduction 
of  HDD  for  data  visualization.  Declarative  X3D  and  X3DOM  are  then  used  for  VA  of  resultant 
data  in  either  the  latest  Mozilla  Firefox  or  Google  Chrome  Web  browser;  these  also  support 
WebGL  for  bringing  3-D  to  the  Web  browser  procedurally. 

Point  characterizations  constructed  from  2-D  orthogonal  views  of  HDD,  i.e.,  scatter  plot 
diagnostics  (scagnostics)  and  scatter  plot  matrix,  are  being  considered  (11).  This  approach  to 
VA  of  HDD  is  guided  by  a  more  vigorous  statistical  analysis. 
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Extensible  3D 
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NO.  OF 

COPIES  ORGANIZATION 


1  DEFENSE  TECHNICAL 
(PDF)  INFORMATION  CTR 
DTIC  OCA 

8725  JOHN  J  KINGMAN  RD 
STE  0944 

FORT  BELVOIR  VA  22060-6218 
1  DIRECTOR 

(PDF)  US  ARMY  RESEARCH  LAB 
RDRL  CIO  LL 
2800  POWDER  MILL  RD 
ADELPHI  MD  20783-1 197 

ABERDEEN  PROVING  GROUND 

1  DIR  USARL 
(PDF)  RDRL  CII  C 

A  NEIDERER 
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Intentionally  lelt  blank. 
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